Demo: Intro to Data Analysis with Python

This session covers:

Prerequisites

Install Python & Jupyter Notebooks

The following instructions will take you through the process of installing Python 3 on your local macOS machine via command line. Before you begin, ensure that you are connected to the internet and to a power source. This process may take around 15-30 minutes.

How To Install Jupyter Notebook on Mac and Windows

Install Required Libraries

This analysis will require the following libraries:

If you receive the error "No module named X" when running the Import Libraries code below, open the terminal to install the missing packing using the command:

pip3 install X

Quick Tip:
To run a cell, click on the cell to select it.

Then:
- To run the cell: COMMAND + RETURN
- To run the cell and move to the next cell: SHIFT + RETURN

Why Do We Use Python?

Free & Easy to Learn

Fast, Flexible, & Scalable

Collaborative


1. Importing

1.1 Importing Libraries

Libraries contain additional functions that are not part of the core Python code, but can be used to support our analysis. Before we can use the functions within a library we must first install, then import the library into our environment.

It is common to alias a library when importing it, to simplify our code.

(Optional) Update Display Settings

Pandas allows the user to customize its default options and settings.

Show All Columns
When dealing with a dataset that contains many columns, Jupyter will automatically collapse a number of columns in the middle of the DataFrame. We can use the following command to update the environment's configuration to display all DataFrame columns and force horizontal scrolling.

Show # of Decimal Places
We can also update the number of decimal places we would like to display for numeric columns. This can be useful if we have a large number of columns.


1.2 Importing Data

Next, we will load the data we would like to analyze. We will be working with the popular Iris dataset from the UC Irvine Machine Learning Repository, which contains 3 iris species as well as a few properties about each flower. This file can be found in the data folder.

Load Data
We can use the Pandas read_csv method to load a CSV file from a particular file path into a DataFrame object.

Similarly, the read_excel method can be used to read a single sheet or a list of sheets from an Excel file.

View Data
We can use Pandas head to "peek" at the first 5 rows of the DataFrame. It is good practice to use the head( ) method on our imported data to ensure that 1) it is the correct dataset, and 2) it has been imported correctly.

Similarly, we can use the tail method to view the last 5 rows of the DataFrame.

We can also specify a certain number of rows to display, ex. df.head(2)

Note: It is also possible to connect to and read data directly from a database.

We can also peek at a single column.



2. Data Exploration

When working with a new (or even familiar!) dataset, the first step should be data exploration. Data exploration helps us to better understand our data, summarize our dataset, and identify issues like missing or unnatural values.

Python offers a number of methods to help us explore our dataset.

Find DataFrame Dimensions
Accessing the shape property returns a tuple that contains the number of rows and columns present in our DataFrame.
Note: A tuple is an ordered set of values (ex. An n-tuple contains n values)

Summary Statistics
The describe method generates descriptive statistics about the numerical columns in our dataset.

We can also use Pandas and NumPy methods to generate individual statistics.
Examples include: mean( ), median( ), min( ), max( ), sum( ), std( )

Info
The info method provides a summary of all columns, data types, and the number of Null (empty) values. This is useful if your data also contains non-numeric columns.

Column List
If we have a dataset containing many columns, it can be helpful to view the entire list of columns.

Find Duplicate Entries
Use the duplicated method to check if an exact entry appears in our dataset multiple times. Based on the goals of our analysis, we may want to either keep or remove duplicate rows from our data.

Unique Values
The unique method returns the unique values of a series object. A series is a 1D array with axis labels. A single DataFrame column is a series.
If our data has categorical columns, it can be useful to retrieve a list of the unique values of a particular category.

Value Counts
We can use the value_counts method to determine the number of times each unique value appears in a Series.

Count Missing Values
Count the number of missing values in each column. In this case, we are not missing any data.

Likewise, we can count the number of non-empty values in each column!


Pandas Profiling

Pandas Profiling is an extremely useful package that generates an interative, all-in-one report on our dataset.

Note: This requires the installation of this additional library:

pip3 install pandas-profiling

Quick Tip:
If a cell produces a lengthy output, you can expand or collapse it by clicking on the left hand side of the output window.


3. Data Visualization

Matplotlib and Seaborn are two of the most popular libraries for visualization in Python. While Matplotlib is useful for plotting basic charts, Seaborn can create "prettier" charts with less syntax. Plotly can be used to make interactive plots.

The following resources provide a detailed overview of common plotting techniques and how to apply them to the Iris dataset.

3.1. Basic Plots

Scatter Plots

Box Plots
Using Seaborn.

Histograms

3.2. Interactive Plots

Use the Plotly package to create interactive plots. Try hovering over a point on the plots below.



4. Data Manipulation

Those familiar with SQL may notice some similarities in pandas. This resource provides examples of how to perform various SQL-like operations in Python with pandas.

4.1. Filtering

Filtering allows us to create a subset of our data that meets a certain condition or list of conditions. We can create filters using Boolean expressions, which assign True/False values based on whether the conditions are met.

This is similar to conditional WHERE or HAVING statements in SQL.

4.1.1. Simple Expressions

Equality Statements
We can use a boolean equality statement to identify the DataFrame rows where the iris variety is Versicolor. This expression returns a vector of True/False values that correspond to each row in our DataFrame.

We can use this expression to filter our DataFrame and retrieve the subset of rows where the boolean expression is True. The following command creates a new DataFrame that contains the subset of rows that met the filter condition.

Not Equal Statements
Filter where DataFrame rows are not equal to some value.

In List
Filter DataFrame rows whose value is present in a list.

Null Values
Filter DataFrame rows where a value is Null.

Note: Only remove Null values if it will not negatively affect the results of your analysis!

Things to Consider When Dropping Data:

Filter DataFrame rows where a value is not null. This is useful if we would like to remove rows with empty or missing values.

4.1.2. Complex Expressions

We can combine boolean expressions to create more complex filters. Each condition MUST be contained within its own set of parenthesis.

AND Statements
Filter where DataFrame rows meet both conditions.

OR Statements
Filter where DataFrame rows meet one of the conditions.

Long Expressions
If we have a long expression, we can define a new variable to hold the filter condition and then pass it to our DataFrame.

4.2. Sorting

The sort_values is used to sort a DataFrame in either ascending or descending order based on the values of a column or columns.

Sort by multiple columns.

4.3. Grouping

The groupby methods allows us to group data based on values, then either summarize or apply operations onto the grouped data.

Use pd.NamedAgg for column specific aggregation.

4.4. Adding, Dropping, & Retrieving

Adding Columns
We can create new columns by performing an action on the column.

Or, we can create new columns as combinations of the existing columns.

Assign columns a based on whether they meet some condition.

Dropping Columns
We can also drop unnecessary columns from our DataFrame.

Dropping Duplicate Rows
The drop_duplicates method can be used return a DataFrame with duplicate rows removed.

Retrieve Column Subset
We can retrieve a subset of columns from our DataFrame.

Retrieve Rows

The .loc attribute retrieves the row at a provided DataFrame index.

4.5. Merging

Similar to a SQL join, merging allows us to combine DataFrames. This resource provides a guide to replicate various SQL join operations in Python.


Insights

What kind of information can we learn about our dataset?

Here are a few examples!

Goal 1: For each iris variety, how many entries have a sepal length of greater than 5.5?

Goal 2: What is the total sum of Versicolor petal areas?

Goal 3: What is the minimum sepal width for each iris variety?

Goal 4: On average, which iris variety has the smallest petal area?

Goal 5: Normalize a column to a common scale.